Classifying Molecular Sequences Using a Linkage Graph With Their Pairwise Similarities
نویسندگان
چکیده
This paper presents a method for classifying a large and mixed set of uncharacterized sequences provided by genome projects. As the measure of sequence similarity, we use similarity score computed by a method based on the dynamic programming (DP), such as the Smith–Waterman local alignment algorithm. Although comparison by DP based method is very sensitive, when given sequences include a family of sequences that are much diverged in evolutionary process, similarity among some of them may be hidden behind spurious similarity of some unrelated sequences. Also the distance derived from the similarity score may not be metric (i.e., triangle inequality may not hold) when some sequences have multi-domain structure. To cope with these problems, we introduce a new graph structure called p-quasi complete graph for describing a family of sequences with a con dence measure. We prove that a restricted version of the pquasi complete graph problem (given a positive integer k, whether a graph contains a 0.5-quasi complete subgraph of which size ¿k or not) is NP-complete. Thus we present an approximation algorithm for classifying a set of sequences using p-quasi complete subgraphs. The e ectiveness of our method is demonstrated by the result of classifying over 4000 protein sequences on the Escherichia coli genome that was completely determined recently. c © 1999—Elsevier Science B.V. All rights reserved
منابع مشابه
A graph-based clustering method for a large set of sequences using a graph partitioning algorithm.
A graph-based clustering method is proposed to cluster protein sequences into families, which automatically improves clusters of the conventional single linkage clustering method. Our approach formulates sequence clustering problem as a kind of graph partitioning problem in a weighted linkage graph, which vertices correspond to sequences, edges correspond to higher similarities than given thres...
متن کاملImage Categorization Using Directed Graphs
Most existing graph-based semi-supervised classification methods use pairwise similarities as edge weights of an undirected graph with images as the nodes of the graph. Recently several new graph construction methods produce, however, directed graph (asymmetric similarity between nodes). A simple symmetrization is often used to convert a directed graph to an undirected one. This, however, loses...
متن کاملMalware Detection using Classification of Variable-Length Sequences
In this paper, a novel method based on the graph is proposed to classify the sequence of variable length as feature extraction. The proposed method overcomes the problems of the traditional graph with variable length of data, without fixing length of sequences, by determining the most frequent instructions and insertion the rest of instructions on the set of “other”, save speed and memory. Acco...
متن کاملDetection of Distant Structural Similarities in a Set of Proteins Using a Fast Graph-Based Method
We introduce a method for finding weak structural similarities in a set of protein structures. Proteins are considered at their secondary structure level. The method uses a rigorous graph-theoretical algorithm which finds all structural similarities. Protein structures are modelled as undirected labelled graphs, the so-called protein graphs. We suggest that for detecting the similarities betwee...
متن کاملBiogeometry Research Faster Multiple Sequence Alignment Algorithms Based on Pairwise Segmentation
Multiple Sequence Alignment (MSA) is a central problem in computational molecular biology --it identifies and quantifies similarities among several protein or DNA sequences.The well-known dynamic programming (DP) algorithms align k sequences (each of length n) by constructing a k-dimensional grid graph of size O(nk), with each of the sequences enumerating one of the dimensions of the grid. The ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Theor. Comput. Sci.
دوره 210 شماره
صفحات -
تاریخ انتشار 1999